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ABSTRACT 


This  paper  introduces  several  data  analysis  routines 
that  were  designed  for  interactive  use  with  APL  (A  program- 
ming L_anguage)  and  placed  in  the  APL  user  library  at  the 
Naval  Postgraduate  School.   Specifically,  histograms,  den- 
sity estimation  and  probability  plotting  routines  are  both 
explained  in  detail  and  demonstrated  with  actual  data.   In 
addition,  applications  and  limitations  on  each  of  the  rou- 
tines are  explored.   And,  the  combined  routines  give  the 
general  user  an  extensive  tool  to  analyze  either  discrete 
or  continuous  data. 
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I.   INTRODUCTION 


The  Naval  Postgraduate  School  acquired  APL   (A  Program- 
ming Language)  from  IBM  in  1974.   Since  that  time  more  and 
more  students  and  faculty  have  become  familiar  with  the  ex- 
tensive and  efficient  capabilities  of  APL   and  have  been 
putting  these  features  to  good  use.   With  the  acquisition  of 
APL   came  several  extensive  library  routines  that  are  both 
well  documented  and  varied  in  scope.   However,  on  close  ex- 
amination of  these  library  routines  it  was  found  that  statis- 
tics and  data  analysis  were  areas  where  some  additions  would 
be  particularly  useful. 

Because  of  the  efficiency  and  ease  of  APL   in  manipulat- 
ing vectors,  matrices  and  arrays,  it  is  ideal  for  use  in  the 
area  of  data  analysis.   After  a  complete  and  thorough  screen- 
ing of  the  existing   APL   library  routines  pertaining  to 
data  analysis,  it  was  found  that  by  adding  six  additional 
data  analysis  routines  to  the  present  library,  the  Naval  Post- 
graduate School  could  enhance  its  present   APL   capability 
and  provide  the  student  and  general  user  with  a  more  varied 
and  flexible  tool  for  analyzing  data. 

To  this  end  the  purpose  of  this  thesis  will  be  (1)  to  com- 
pletely describe  the  six  data  analysis  routines  added  to  the 
APL   library,  (2)  to  explain  the  features  and  capabilities  of 
each  of  the  routines  and  (3)  to  demonstrate  the  use  of  each 
of  the  routines  with  "real  world  data". 


The  data  to  be  used  in  this  paper  has  come  from  two  dif- 
ferent sources.   The  first  source  of  data  was  from  tests  per- 
formed jointly  by  IBM  Germany  and  the  German  Public  Telephone 
Network  on  errors  in  transmission  of  binary  data  on  telephone 
lines  (Lewis  &  Cox,  1966).   From  this  source  two  sets  of  data 
are  used  and  each  data  set  contains  the  times  between  errors 
in  binary  bits  transmitted  over  telephone  lines.   The  first 
data  set  contains  672  elements  ( times-between-errors :  actual- 
ly number  of  bits  between  errors)  and  will  hereby  be  referred 
to  as  "telephone  data  1".   The  second  data  set  contains  736 
elements  and  will  be  referred  to  as  "telephone  data  2".   The 
second  source  of  data  was  obtained  from  percent  overrun  or 
underrun  on  selected  military  contracts  during  the  year  1950 
(Dixon,  1973).   This  data  set  contains  22  elements  and  will 
be  referred  to  as  "cost  overrun  data". 


II.   HISTOGRAM  ROUTINE 

A.   DESCRIPTION 

The  first  routine  to  be  presented  is  the  histogram  rou- 
tine which  is  used  for  estimating  from  given  data  the  proba- 
bility density  function   f(x)   of  a  continuous  random  vari- 
able.  The  current   APL   library  has  several  small  histogram 
routines  that  are  general  in  nature  but  lack  the  overal  de- 
tail necessary  for  good  data  analysis.   For  this  reason   HIST 
(histogram  routine)  was  created.   HIST   represents  the  adap- 
tion and  modification  of  the  fortran  library  version  of 
HISTG/F,  which  was  developed  at  N.P.S.  by  D.  R.  Robinson 
under  the  guidance  of  Professor  P.A.W.  Lewis.   By  modifying 
and  adapting   HISTG/F   to   APL   the  power  and  efficiency  of 
the   APL   language  could  be  put  to  full  use. 

A  complete  description  of  how   HIST   operates  is  con- 
tained in  the  variable   HISTHOW.   If  the  users   APL   work- 
space is  properly  loaded  (see  section  IX. B.  for  workspace 
loading  procedures)  all  that  is  necessary  is  to  type   HIST- 
HOW.  The  user  then  receives  the  following  printed  response 
on  the  termi  nal : 

HISTHOW 

SYNTAX      HIST 

HIST   ALLOWS    YOU    TO    INTERACTIVELY  OBTAIN      A      HISTOGRAM      OF 

YOUR    DATA    ALONG    WITH    A    SET    OF    BASIC  DESCRIPTIVE      STATISTICS . 

IN    ADDITION,    HIST    HAS    THE   FOLLOWING  CAPABILITIES    WHICH    ALLOW 
YOU: 


10 


(1)  THE   OPTION    OF   A    TITLE    FOR    YOUR    HISTOGRAM 

(2)  THE  OPTION  OF  .DISPLAYING  A  SMOOTHED  EMPIRICAL  DENSITY 
FUNCTION   OVER    THE    HISTOGRAM 

( 3 )  THE  OPTION  OF  SCALING  AND  SELECTING  THE  NUMBER  OF 
CELLS   FOR    YOUR    HISTOGRAM 

(4)  THE  OPTION  OF  SELECTING  AN  INTERVAL  AND  PERFORMING  A 
HISTOGRAM  ON  ALL  THE  DATA  POINTS  OR  CONDITIONALLY 
SELECTING   AN    INTERVAL    IN    THE   RANGE    OF    THE    DATA. 

(5)  THE  OPTION  OF  HAVING  YOUR  OUTPUT  APPEAR  ON  THE 
OFFLINE   PRINTER    OR    ON    YOUR    TERMINAL 


WHEN    YOU    TYPE    HIST    YOU    WILL    BE   ASKED    TO    DO    THE    FOLLOWING-.- 

(1)  ENTER  YOUR  DATA  IN  VECTOR  FORM  -  YOU  CAN  TYPE  YOUR  DATA 
IN  SINGLY  OR  YOU  CAN  TYPE  THE  NAME  OF  A  VARIABLE  THAT 
HAS  YOUR  DATA  IN  IT.  YOU  MUST  ENSURE  THAT  YOU  HAVE  AT 
LEAST  10  DATA  POINTS  IN  YOUR  VECTOR  AND  THAT  THERE  IS 
SOME  DIFFERENCES  IN  THE  DATA  POINTS  {MAX  SIZE  OF  INTEGER 
VECTOR  IS  APPRO X.  2  5  00  ,  MAX  SIZE  OF  REAL  VECTOR  IS 
2000  ).  AFTER  YOU  HAVE  ENTERED  YOUR  DATA  YOU  WILL  BE 
ASKED 

(2)  IF  YOU  DESIRE  A  SMOOTHED  EMPIRICAL  DENSITY  FUNCTION  OR 
NOT.  THE  EMPIRICAL  DENSITY  FUNCTION  WHEN  PLOTTED  GIVES 
ESSENTIALLY  A  MORE  EXACT  PICTURE  OF  THE  DATA  THAN  DOES 
THE  HISTOGRAM  ALONE,  ALTHOUGH  THIS  FEATURE  IS  SLIGHTLY 
BLURRED  BY  THE  PRECISION  WHICH  CAN  BE  OBTAINED  WITH  THE 
APL  BALL  (THE  APL  FINE  PLOT  IS  NOT  PRESENTLY  AVAILA- 
BLE ON  THE  NPS  SYSTEM),  THE  SMOOTHED  EMPIRICAL  DENSITY 
IS  DEFINED  BY  THE  RELATION  (LEWIS , LIU , ROBINSON ,  AND  ROS- 
ENBLATT ,197 5  ;  ROSENBLATT, 1256  ) 


1 

-U 

F(Z)    = 

N 

\ 
/_ 

W((X 

I 

N 

x    B(N) 

1  =  1 

Z)    t  B(N)) 


WHERE      N      IS    THE    NUMBER    OF    DATA    POINTS,    B(N)    IS    A      BAND- 
WIDTH  FUNCTION, 

B(N)    =  RANGE    *  SQRT(N) 
AND      W      IS    A    WEIGHT   FUNCTION, 

W(Z)    =  0  IF     \Z\     >    1 

=  1  -  1 Z |    OTHERWISE 
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Z   BETWEEN   TEE  MAXIMUM 

AND   PLOTTED   OVER   TEE 

TEE  RELATIVE  FREQUENCY 

REFER  TO  TEE   EI 5  TO  GRAM, 

AFTER  TEIS   QUERY   YOU 


F(Z)  IS  COMPUTED  FOR  VALUES  OF 
AND  TEE  MINIMUM  OF  TEE  SAMPLE 
EISTOGRAM  USING  TEE  SYMBOL  -F- 
MARKS  ON  TEE  LEFT  OF  TEE  OUTPUT 
AND  NOT  TO  TEE  DENSITY  FUNCTION, 
WILL  BE  ASKED 

(3)  IF  YOU  DESIRE  TO  TITLE  YOUR  EISTOGRAM.  IF  YOU  ELECT  TO 
TITLE  YOUR  EISTOGRAM ,  SIMPLY  TYPE  YOUR  TITLE,  ENSURING 
TEAT  YOUR  TITLE  IS  MORE  TEAN  ONE  CEARACTER  IN  LENGTE . 
IF  NO  TITLE  IS  DESIRED  JUST  EIT  TEE  CARRIAGE  RETURN. 
AFTER  TEE  TITLE  QUERY  YOU  WILL  BE  ASKED 

(4)  IF  YOU  WANT  TO  SET  YOUR  OWN  SCALE  AND  TEE  NUMBER  OF 
CELLS.  YOUR  RESPONSE  MUST  BE  A  VECTOR  OF  3  ELEMENTS 
TEE  FIRST  ELEMENT  IS  TEE  NUMBER  OF  CELLS  YOU  DESIRE, 
TEIS  MUST  BE  AN  INTEGER  BETWEEN  10  AND  2  8  ,  TEE 
SECOND  ELEMENT  IS  TEE  LEFT  SCALE  POINT  AND  TEE  TEIRD 
ELEMENT  IS  TEE  RIGET  SCALE  POINT  (EIST  DOES  NOT  REQUIRE 
TEAT  YOUR  INTERVAL  BE  DIVISIBLE  BY  TEE  NUMBER  OF  CELLS). 
IF  YOU  WANT  EIST  TO  AUTOMATICALLY  SCALE  AND  PICK  TEE 
CELLS  YOU  SEOULD  TYPE  TEE  VECTOR  0  0  0  .  AFTER  YOU 
EAVE  SELECTED  YOUR  SCALING  TECENIQUE  YOU  WILL  BE  ASKED 

(5)  IF  YOU  WANT  DATA  POINTS  NOT  INSIDE  TEE  SCALE  LIMITS 
INCLUDED  IN  TEE  EISTOGRAM  ROUTINE.  MOST  EISTOGRAMS  LUMP 
DATA  POINTS  TEAT  FALL  OUTSIDE  TEE  SCALE  LIMITS  IN  TEE 
END  CELLS.  EOWEVER ,  EIST  GIVES  YOU  TEE  OPTION  OF 
INCLUDING  TEEM  OR  EXCLUDING  TEEM,  I.E. 
EISTOGRAM  FOR  TEE  CONDITIONAL  DENSITY. 
SPONSE  TO  TEIS  QUERY  YOU  WILL  BE  ASKED 

(6)  IF  YOU  WANT  YOUR  OUTPUT  TO  APPEAR  ON  TEE  OFFLINE  PRINTER 
OR  ON  YOUR  TERMINAL.  IF  YOU  SELECT  TEE  OFFLINE  PRINTER 
TEE  NEXT  RESPONSE  YOU  WILL  RECEIVE  ON  YOUR  TERMINAL  IS 
-  EISTOGRAM  SENT  TO  PRINTER  -.  TEIS  RESPONSE  WILL  TAKE 
SEVERAL  SECONDS  AND  AFTER  IT  IS  RECEIVED  YOUR  TERMINAL 
IS  FREE  FOR  FURTEER  USE.  EOWEVER,  IF  YOU  ELECTED  TO 
EAVE  YOUR  EISTOGRAM  PRINTED  ON  YOUR  TERMINAL  TEE 
PRINTING  WOULD  BEGIN  IN  JUST  A  FEW  SECONDS  BUT  WOULD 
TAKE  BETWEEN   5   AND   10   MINUTES  TO  COMPLETE. 


OF   OBTAINING   A 
AFTER   YOUR   RE- 


TEE  FOLLOWING  BASIC  DESCRIPTIVE  STATISTICS   ARE   COMPUTED 
AND  PRINTED  OUT  BY  EIST. 

MEAN,  MEDIAN,  TRIMEAN ,  MIDMEAN ,  MODE 
GEOMETRIC  AND  EARMONIC   MEANS   {POSITIVE 
VARIANCE,  STANDARD  DEVIATION,  COEFFICIENT 

RANGE  AND  MIDSPREAD 
TEIRD  AND  FOURTE  CENTRAL  MOMENTS,  COEFFICIENTS 

NESS  AND  KURTOSIS 
MAXIMUM,  MINIMUM  AND   5   SAMPLE  QUANTILES 


SAMPLES   ONLY ) 
OF   VARIATION, 


OF   SKEW- 
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IN  ADDITION,  THE  MEAN  IS  DISPLAYED  ON  THE  HISTOGRAM  BY  A 
VERTICAL  COLUMN  OF  -M-  AND  THE  QUARTILES  BY  COLUMNS  OF 
DOTS. 

INTERPRETING    THE   OUTPUT 

THE  DEFINITIONS  OF  THE  BASIC  STATISTICS  COMPUTED  BY  HIST 
ARE  LISTED  BELOW.  PAGE  NUMBER  REFERENCES  ARE  TO  THE  CRC 
STANDARD    MATH    TABLES,     19TH    EDITION    (1971). 

MEAN  AVERAGE    OF    THE    SAMPLE    (P  554). 

MEDIAN  MID -VALUE    OF    THE    SAMPLE,       IF      THERE      ARE      AN         ODD 

NUMBER  OF  SAMPLE  POINTS,  OR  THE  AVERAGE  OF  THE  TWO 
MIDDLE    VALUES    FOR    AN    EVEN    NUMBER    OF    POINTS    (P  555) 

SAMPLE  THE    5(1)  =  .25,  3(2)  =  .50,  AND      5(3)=. 75  POPULATION 

QUARTILES  QUARTILES  ARE  THE  SOLUTION  TO  THE  EQUATION 
PROB  (X  <  X(Q(I)))  =  Q(I)  1=1,2,3  .  THE  SAMPLE 
QUARTILES,  WHICH  ESTIMATE  THE  POPULATION  QUARTILES 
ARE,  THE  JTH  ORDERED  VALUE  IN  THE  SAMPLE,  WHERE 
J    =  [  Q(I)*N    ]  +  1  .  WHERE    N    =    SAMPLE   SIZE. 

TRIMEAN  0.25  x  (3(1)  +  2Q(2)  +  Q(3)),  WHERE  THE  Q(I)  ARE 
THE   QUARTILES. 

MIDMEAN  THE  AVERAGE  OF  ALL  THE  SAMPLE  VALUES  BETWEEN  THE 
UPPER   AND    LOWER   QUARTILES. 

MODE  THE   DATA    POINT    THAT    OCCURS   MOST   OFTEN       (IF    ALL    THE 

DATA  POINTS  ARE  DIFFERENT  OR  IF  THERE  ARE  MORE 
THAN  300  DATA  POINTS  THE  MODE  WILL  NOT  BE  PRINTED. 
IF  TWO  OR  MORE  MODES  OCCUR  HIST  WILL  PRINT  THE 
FIRST   MODE. ) 

MIDRANGE      AVERAGE    OF    THE   MAXIMUM   AND   MINIMUM. 

GEOMETRIC    (P  5  54). 
MEAN 

HARMONIC       (P  555). 
MEAN 

VARIANCE  (P  557).  UNBIASED  ESTIMATORS  FOR  VARIANCE  AND 
STANDARD    DEVIATION   ARE    USED. 

STANDARD       (P  5  57). 
DEVIATION 
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COEFFICIENT  OF  VARIATION  =  STANDARD  DEVIATION  *  \MEAN\  WHEN 
THE  MEAN  IS  LESS  THAN  IE -30,  THE  COEFFICIENT  OF 
VARIATION   IS   SET   TO    ZERO. 

MEAN  (P  556).  THE   AVERAGE   OF    THE   SUM      OF      THE      ABSOLUTE 

DEVIATION  DIFFERENCES  BETWEEN  THE  SAMPLE  VALUES  AND  THE 
MEDIAN. 

RANGE  MAXIMUM    -    MINIMUM    (P  557). 

MIDSPREAD  2(3)  -  Q(l)  ,  ALSO  CALLED  THE  INTERQUARTILE 
DISTANCE. 

A/3  THIRD  CENTRAL  MOMENT.  UNBIASED  ESTIMATOR  IS  USED. 
(P  558  ) 

MH  FOURTH    CENTRAL   MOMENT.    UNBIASED    ESTIMATOR    IS    USED. 

(P  558 ) 

COEFFICIENT   OF   SKEWNESS         M2    *  (STD    DEV)*3 

COEFFICIENT   OF   KURTOSIS         (  MH    i    (STD    DEV)*^     )  -  3 

BETA1  BIASED    ESTIMATE   OF    THIRD    CENTRAL      MOMENT.       CAN      BE 

USED    IN    TESTING    FOR    NORMALITY.     (BIOMETRIKA       TABLES 
FOR    STATISTICIANS ,1966) . 

3ETA2  BIASED    ESTIMATE    OF    FOURTH    CENTRAL    MOMENT.     (BIOMET- 

RIKA    TABLES    FOR    STATISTICIANS , 196  6 ) . 

MAXIMUM         LARGEST   SAMPLE    VALUE. 

MINIMUM         SMALLEST   SAMPLE    VALUE. 

SAMPLE  THE      a-QUANTILE,    X(a)t    IS    THE    SOLUTION    TO    THE      EQ . 

QUANTIZES    PROBABILITY    (X    <    1(a))  =  a   . 


With  this  complete  description  the  general  user  should 
be  able  to  take  full  advantage  of   HIST   and  put  to  use  all 
its  options. 
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B.   USAGE  WITH  TELEPHONE  DATA  1  AND  TELEPHONE 
DATA  2,  OFFLINE,  ALL  DATA,  ECDF,  AND  TITLE 

HIST   was  now  used  on  two  sets  of  data.   Both  telephone 
data  1  and  telephone  data  2  were  first  used  with  the  offline 
printer  demonstrating  the  title  option,  the  empirical  den- 
sity function  option  and  using  the  conditional  option  with 
any  data  points  outside  the  designated  interval  being  lumped 
into  the  end  cells.   When   HIST  was  typed  the  following  re- 
sponses to  each  of  the  queries  were  entered. 

HIST 
ENTER    DATA    IN    VECTOR    FOR"-! 

TELDAT1 

IF    YOU    ALSO    WANT   A    SMOOTHED    EMPIRICAL    DENSITY    FUNCTION    ENTER 

A  1       .  IF    YOU    DO    "OT    WANT    IT    ENTER    A         o 

0: 

1 

IF  YOU  WANT  TO  TITLE  YOUR  HISTOGRAM  TYP*  YnU^  TITLE, 
IF      YOU      DO      NOT    WANT    A    TITLE    JUST    FIT    TFF    CARRIAGE       RETURN, 

TELEPHONE    DATA    1 

IF  YOU  WANT  TO  SET  TFE  NUMBER  OF  CELLS  AND  THE  SOALE  ENTER 
FIRST  THE  NUMBER  OF  CELLS  (AN  INTEGER  BETWEEN  10  AND  23) 
FOLLOWED  FY  A  SPACE  AND  THEN  YOUR  LEFm  SCALE  POINT  FOLLOWED 
BY  A  SPACE  AND  THEN  YOUR  RIGHT  SCALE  POINT.  HOWEVER ,  IF  YOU 
WANT  HIST  TO  AUTOMATICALLY  SCALE  ENTER  0  0  0 
□  : 

28  0  20000 

GIVEN  THAT  YOU  HAVE  SET  YOUR  OWN  SCALE,  TO  INCLUDE  DATA 
POINTS  THAT  MIGHT  BE  OUTSIDE  YOUR  SCALE  LIMITS  IN  TlJF  END 
CELLS,  TYPE  1  .  IF  YOU  DESIGNATED  AUTOS  CALF  ALSO,  TYP? 
1  .  IF  HOWEVER,  YOU  DO  NOT  WANT  THE  DATA  OUTSIDE  THE  SCALE 
LIMITS    INCLUDED    IN    THE    HISTOGRAM,    TYPE         0 

□ 

1 

IF  YOU  WANT  YOUR  OUTPUT  TO  A'DV'rAP  ON  T^r  OFFLINE  ^^I^T^R , 
TYPE  1  .  IF  YOU  WANT  YOUR  OUTPUT  TO  APPEAR  ON  YOUR 
TERMINAL,  TYPE  0  .  (NOTE  IF  YOU  TY"FP  0  B"  SURE  YOUR 
TERMINALS    CARRIAGE    ^AGE    SETTING    IS      ON       THE      MAXIMUM      WIDTH) 

1 
HISTOGRAM    SENT    TO    PRINTER 
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Note  that  telephone  data  1  was  contained  in  the  variable 
TELDAT1   and  that  the  number  of  cells  chosen  was  28  with  the 
left  scale  point  being   0   and  the  right  scale  point  being 
20,000. 

After  the  response  -  HISTOGRAM  SENT  TO  PRINTER  -  was  re- 
ceived.  HIST   was  again  typed  under  identical  conditions 
and  telephone  data  2  was  entered  through  the  variable 

TELDAT2. 

HIST 

ENTER   DATA    IN    VECTOR   FORM 
D: 

TELDAT2 

IE  YOU  ALSO  WANT  A  SMOOTHED  EMPIRICAL  DENSITY  FUNCTION  ENTER 
A         1   .  IE    YOU    DO    NOT    WANT    IT    ENTEV    A         0 

U: 

1 

IF  YOU  WANT  TO  TITLE  YOUR  HISTOGRAM  TYP*  YOUR  TITLE. 
IE      YOU      DO      NOT    WANT   A    TITLE    JUST    PIT    TPE    CARRIAGE      RETURN, 

TELEPHONE    DATA    2 

IE  YOU  WANT  TO  SET  THE  NUMBER  OE  CELLS  AND  THE  SCALE  ENTER 
FIRST  THE  NUMBER  OF  CELLS  (AN  INTEGER  BETWEEN  10  AND  28) 
FOLLOWED  BY  A  SPACE  AND  THEN  YOUR  LEFT  SCALE  POINT  FOLLOWED 
BY  A  SPACE  AND  THEN  YOUR  RIGHT  SCALE  POINT.  HOWEVER,  IE  YOU 
WANT  HIST  TO  AUTOMATICALLY  SCALE  ENTER  0  0  0 
D: 

23  0  20000 

GIVEN  THAT  YOU  HAVE  SET  YOUR  OWN  SrALE ,  TO  INCLUDE  DATA 
POINTS  THAT  MIGHT  BE  OUTSIDE  YOUR  SCALE  LIMITS  .  IN  TPE  END 
CELLS,  TYPE  1  .  IF  YOU  DESIGNATED  AUTOSCALE  ALSO,  TYPE 
1  .  IF  HOWEVER,  YOU  DO  NOT  WANT  THE  DATA  OUTSIDE  TPE  SrALE 
LIMITS  INCLUDED  IN  THE  HISTOGRAM,  TYPE  0 
n . 

1 

IF  YOU  WANT  YOUR  OUTPUT  TO  APPEAR  ov  THE  OFFLINE  PRINTER , 
TYPE  1  .  IE  YOU  WANT  YOUR  OUTPUT  TO  APPEAR  ON  YOU" 
TERMINAL,  TYPE  0  .  (NOTE  IE  YOU  TYPED  0  BE  SUP*  YOU" 
TERMINALS  CARRIAGE  PACE  SETTING  IS  ON  IP?  MAXIMUM  WT^w) 
G  : 

1 
HISTOGRAM   SENT    TO    PRINTER 
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Now  by  looking  at  figure  1  (output  for  telephone  data  1) 
and  figure  2  (output  from  telephone  data  2)  the  similarities 
and  differences  in  the  histograms  can  be  compared.   Without 
getting  into  specifics,  the  empirical  density  function  plot 
seems  to  indicate  that  both  sets  of  data  are  similar.   How- 
ever, the  one  time-between-errors  dominate  the  data;  a  more 
detailed  discussion  of  this  data  and  its  analysis  is  given 
in  Section  VIII. 
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C.   USAGE  WITH  TELEPHONE  DATA  1  AND  TELEPHONE  DATA  2,  ON 

LINE;  CONDITIONAL  DATA  BETWEEN  2  AND  140,  ECDF,  AND  TITLE 

Because  both  sets  of  data  contain: 

(1)  a  large  number  of  elements, 

(2)  a  large  number  of  times-between-error  equal  to  1 
(this  becomes  more  apparent  when   HISTLIST   is 
described),  and 

(3)  the  range  of  the  data  sets  is  so  extensive, 

it  would  appear  that  the  conditional  option  available  on 
HIST   could  be  used  to  see  if  the  two  data  sets  are  in  fact 
similar  over  a  smaller  interval.   This  in  fact  was  done  us- 
ing the  on  line  printer  option,  the  empirical  density  func- 
tion option,  the  title  option  and  the  conditional  option 
with  any  data  points  outside  the  designated  interval  excluded 
from  the  histogram. 


HIST 
ENTER    DATA    IN    VECTOR    FORM 
LI  t 

TELDATl 

IF    YOU    ALSO    WANT   A    SMOOTHED    EMPIRICAL    DENSITY    FUNCTION    ENTER 

A  1   .  IF    YOU    DO    NOT    WANT    IT    ENTER    A         0 

D: 

1 

IF      YOU    WANT      TO       TITLE      YOUR      HISTOGRAM      TYPE      YOUR      TITLE. 
IF      YOU      DO      NOT    WANT   A    TITLE    JUST    HIT    THE    CARRIAGE      RETURN. 

TELEPHONE    DATA    1    BETWEEN    2    AND    l«+0 

IF  YOU  WANT  TO  SET  THE  NUMBER  OF  CELLS  AND  THE  SCALE  ENTER 
FIRST  THE  NUMBER  OF  CELLS  (AN  INTEGER  BETWEEN  10  AND  28) 
FOLLOWED  BY  A  SPACE  AND  THEN  YOUR  LEFT  SCALE  POINT  FOLLOWED 
BY  A  SPACE  AND  THEN  YOUR  RIGHT  SCALE  POINT.  HOWEVER,  IF  YOU 
WANT  HIST  TO  AUTOMATICALLY  SCALE  ENTER  0  0  0 
J  : 

23  2  140 
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GIVEN  THAT  YOU  HAVE  SET  YOUR  OWN  SCALE,  TO  INCLUDE  DATA 
POINTS  THAT  MIGHT  BE  OUTSIDE  YOUR  SCALE  LTV  ITS  IN  THE  END 
CELLS,  TYPE  1  .  IF  YOU  DESIGNATED  AUTOSrALE  ALSO,  TYPE 
1  .  IF  HOWEVER,  YOU  DO  NOT  WANT  THE  "ATA  OUTSIDE  TVE  SCALE 
LIMITS    INCLUDED    IN    THE    HISTOGRAM ,    TY^E         0 

U: 

0 

IF  YOU  WANT  YOUR  OUTPUT  TO  APPEAR  ON  TH*  OFFLINE  PRINTER „ 
TYPE  1  .  IF  YOU  WANT  YOUR  OUTnUm  TO  AnT)EAR  ON  YOUR 
TERMINAL,  TYPE  0  .  (NOTE  IF  YOU  TYVED  0  BE  SURP  YOUR 
TERMINALS  CARRIAGE  PAGE  SETTING  IS  ON  THE  MAXIMUM  WIDTH) 
□  : 

o 


Note  that  the  same  variable   TELDAT1   is  used  but  this 
time  the  interval  was  between   2   and   140.   Also,  the  - 
HISTOGRAM  SENT  TO  PRINTER  -  was  not  typed  because  the  on- 
line printer  (terminal)  option  was  employed. 

After  the  output  for  telephone  data  1  was  printed   HIST 
was  again  typed  and  telephone  data  2  was  entered  under  iden- 
tical conditions. 

HIST 

ENTER    DATA    IN    VECTOR    FORM 

n  . 

TELDAT2 

IF    YOU    ALSO    WANT    A    SMOOTHED    EMPIRICAL    D^NSImY    FUNCTION    ENTER 

A  1   .  IF    YOU    DO    NOT    WANT    IT    ENTER    A  0 


IF  YOU  WANT  TO  TITLE  YOUR  HISTOGRAM  TYPE  YOUR  TITLE. 
IF      YOU      DO      NOT    WANT   A    TITLE    JUST    HIT    THE    CARRIAGE      RETURN. 

TELEPHONE    DATA    2  BETWEEN    2    AND    140 

IF  YOU  WANT  TO  SET  THE  NUMBER  OF  CELLS  AND  THE  S"ALE  ENTER 
FIRST  THE  NUMBER  OF  CELLS  (AN  INTEGER  BETWEEN  10  AND  28) 
FOLLOWED  BY  A  SPACE  AND  THEN  YOUR  LEFT  SCALE  POINT  FOLLOWED 
BY    A    SPACE    AND    THEN    YOUR    RIGHT   SCALE    POINT.       HOWEVER,    IF    YOU 

WANT    HIST    TO    AUTOMATICALLY    SCALE    ENTER         0  0  0 

n  . 

23  2  H+0 
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GIVEN  THAT  YOU  HAVE  SET  YOUR  OWN  SCALE ,  TO  INCLUDE  DATA 
POINTS  THAT  MIGHT  BE  OUTSIDE  YOUR  SCALE  LIMITS  IN  T^F  END 
CELLS ,  TYPE  1  .  IF  YOU  DESIGNATED  AUTOS C ALE  ALSO,  TYRE 
1  .  IF  HOWEVER ,  YOU  DO  NOT  WANT  THE  DATA  OUTSIDE  THE  S^ALE 
LIMITS    INCLUDED    IN    THE   HISTOGRAM ,  TYRE         0 

□  : 

0 

IF  YOU  WANT  YOUR  OUTPUT  TO  APPEAR  ON  THE  OFFLINE  PRINTER* 
TYPE  1  .  IF  YOU  WANT  YOUR  OUTPUT  TO  APPEAR  ON  '  YOUR 
TERMINAL,  TYPE  0  .  {NOTE  IF  YOU  TYPED  0  BE  SURE  YOUR 
TERMINALS    CARRIAGE    PAGE    SETTING    IS      ON      THE       MAXIMUM      WIDTH) 

□  : 

0 


Figure  3  (output  from  telephone  data  1  between  2  and 
140)  and  figure  4  (output  from  telephone  data  2  between  2 
and  140)  now  appear  quite  different  in  shape  based  on  the 
empirical  density  function  plot.   This  is,  again,  because 
of  the  extensive  range  of  the  data  (85,993  for  telephone 
data  1  and  67,271  for  telephone  data  2)  and  the  large  number 
of  ti mes-between-error  equal  to  one.   Both  sets  of  data  are 
actually  discrete,  only  occurring  at  multiples  of  1,  but  as 
an  initial  analysis  the  data  sets  were  treated  as  continuous. 
Thus,  by  employing  the  conditional  option  available  on   HIST 
differences  in  the- two  sets  of  data  become  quite  apparent 
whereas  before,  the  differences  were  not  so  easily  detected. 
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FIGURE    4 
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III.   LISTING  ROUTINE 

A.   DESCRIPTION 

The  second  routine  presented  is  a  listing  routine.   APL 
has  a  function  that  will  automatically  sort  the  data  and 
print  the  results.   However,  the  unique  feature  of   HISTLIST 
(listing  routine)  is  that  it  takes  advantage  of  like  occur- 
rences in  the  data  and  prints  the  ordered  data  ascendingly 
in  a  compressed  form.   This  becomes  highly  useful  when  list- 
ing a  large  number  of  data  points  that  contain  multiple  oc- 
currences.  It  is  also  a  tool  for  finding  multiplicities  in 
supposedly  continuous  data,  and  a  probability  function  esti- 
mating routine  for  data  which  is  known  to  be  discrete. 

A  complete  description  of  how   HISTLIST   operates  is 
contained  in  the  variable   HISTLISTHOW.   When  the  user  types 
HISTLISTHOW   the  following  response  is  printed  on  the 
termi  nal  : 

HISTLISTHOW 

SYNTAX      HISTLIST 

HISTLIST  IS  A  HIGHLY  CONVENIENT  WAY  TO  LIST  YOUR  DATA. 
HISTLIST  TAKES  YOUR  DATA,  ORDERS  IT  AND  COMPRESSES  IT.  FOR 
EXAMPLE,  IF  THREE  DATA  POINTS  WERE  ALL  THE  SAME  VALUE 
HISTLIST  WOULD  JUST  PRINT  THE  VALUE  ONCE  AND  THEN  PRINT  THE 
NUMBER  OF  OCCURENCES  OF  THAT  VALUE.  HISTLIST  WILL  ALSO 
PRINT  THE  SERIAL  NUMBER  OF  THE  DATA,  THE  PERCENTAGE  THIS 
SAMPLE  VALUE  IS  TO  THE  WHOLE  SAMPLE,  AND  A  SMALL  HISTOGRAM 
(STARS)    SHOWING    RELATIVE   PERCENTAGES .    EXAMPLE:     6   t   4   3   4 

HISTLIST 
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SER.    NUM.         ORDERED    DATA  NUMBER   OF   OCCURENCES  PER    CENT 

13  1     ****  .2  0 

2  4  3     ************      .60 

5  6  1     ****  .20 

HISTLIST   IS    IDEALLY  SUITED    FOR   A    LARGE   SAMPLE   THAT      COULD 

POSSIBLY   HAVE   A    LOT   OF  LIKE   OCCURENCES.    HISTLIST   FURTHER   HAS 

THE   ADVANTAGE    OF   BEING  USED    WITH   EITHER    THE   OFFLINE      PRINTER 
OR    THE   USERS    TERMINAL. 


B.   USAGE  WITH  TELEPHONE  DATA  1  AND  TELEPHONE  DATA  2  OFFLINE 

HISTLIST   was  used  with  the  title  option  and  offline 
printer  option  on  both  telephone  data  1  and  telephone  data  2 
When   HISTLIST   was  typed  the  following  responses  to  each  of 
the  queries  were  entered. 

HISTLIST . 
HISTLIST    PRINTS    THE    SERIAL    NUMBER    OF    T"E      nO";PRESSvD 
DATA,    THE    ORDERED    DATA    COMPRESSED ,  AND    T"E    NUMBER    OF 
LIKE      OCCURENCES .       ENTER    YOUR      DATA       IN    VECTOR    FORM. 

□  : 

TELDAT1 

IF  YOU  WANT  TO  TITLE  YOUR  DATA  TY?E  YOUR  TITLE. 
IF  YOU  DO  NOT  WANT  A  TITLE  JUST  HIT  THE  CARRIAGE 
RETURN . 

TELEPHONE    DATA    1 

IF  YOU  WANT  YOUR  OUTPUT  TO  APPEAR  ON  THE  OFFLINE 
PRINTER  TYPE  1  .  IF  YOU  WANT  YOUR  OUTPUT  TO  APPEAR 
ON    YOUR    TERMINAL    TYPE      0 

□  : 

1 
HISTLIST   SENT    TO    PRINTER 

After  the  response  -  HISTLIST  SENT  TO  PRINTER  -  was  re- 
ceived  HISTLIST   was  again  typed  and  telephone  data  2  was 
entered. 
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HISTLIST 
HISTLIST   PRINTS    THE    SERIAL    NUMBER    OF    THE      COMPRESSED 
DATA,    THE    ORDERED    DATA    COMPRESSED ,  AND    THE    NUMBER    OF 
LIKE      OCCURENCES.       ENTER    YOUR      DATA       IN    VECTOR    FORM. 

□  : 

TELDAT2 

IF  YOU  WANT  TO  TITLE  YOUR  DATA  TYPE  YOUR  TITLE. 
IF  YOU  DO  NOT  WANT  A  TITLE  JUST  HIT  THE  CARRIAGE 
RETURN. 

TELEPHONE    DATA    2 

IF    YOU    WANT    YOUR  OUTPUT    TO    APPEAR      ON      THE      OFFLINE 

PRINTER    TYPE       1  .    IF    YOU    WANT    YOUR    OUTPUT    TO    APPEAR 

ON    YOUR    TERMINAL  TYPE      0 
D: 

1 

HISTLIST   SENT    TO  PRINTER 


Looking  at  figure  5  (output  with  telephone  data  1)  and 
figure  6  (output  with  telephone  data  2)  the  listings  of  the 
two  data  sets  can  be  compared.   It  can  be  seen  that  both 
telephone  data  1  and  telephone  data  2  contain  a  large  number 
of  multiple  occurrences  of  the  number  one  and  the  number  two 
In  fact   19%   of  telephone  data  1  is  the  number  one  and   24% 
of  telephone  data  2  is  the  number  one.   Also,  telephone  data 
2  has  many  more  multiple  occurrences  in  the  120  to  130  range 
than  telephone  data  1.   This  was  quickly  apparent  when  one 
looked  at  the  stars  to  the  right  of  the  ordered  data. 
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FIGURE    5A 
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??'^l.JOO                6   *  0.0C9 

^.CCGOGU                4  0*006 

E3.C0C0O0               2  0*003 

24.CC0C00                4  0*006 

gS.CaCCOO 3 SIO  04 

38.Cl.0OOO                2  0*003 
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FIGURE  5B 
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FIGURE    5C 
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FIGURE  6B 
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FIGURE  6D 
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In  addition,   HISTLIST   saved  on  printing  time  and  paper. 
By  printing  the  data  in  compressed  form   HISTLIST   saved 
printing  448  lines  (6  additional  pages)  in  the  case  of  tele- 
phone data  1  and  419  lines  (5  additional  pages)  in  the  case 
of  telephone  data  2.   Thus,  HISTLIST   not  only  gives  the 
user  more  information  than  an  ordered  listing  of  the  data, 
but  also  is  cost  effective  in  terms  of  printing  time  and 
paper  used.   Finally,  note  that  it  is  not  possible  to  look 
at  the  data  in  as  much  detail  with  routine   HIST   as  with 
HISTLIST.   If  the  data  is  continuous  and  there  are  no  multi- 
plicities, then   HISTLIST   gives  only  this  information  and 
an  ordered  listing  of  the  data.   The  shape  of  the  density 
function  can  best  be  seen  (estimated)  in  using  routine   HIST 
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IV.   SECTIONING  ROUTINE 


A.   DESCRIPTION 

The  third  routine  presented  is  the  sectioning  routine, 
HISTS.   HISTS   (sectioning  routine)  gives  a  way  of  assessing 
the  variability  of  estimates  of  descriptive  statistics  from 
sample  data.   It  is  essential  that  the  data  be  in  random 
order . 

The  basic  idea  is  as  folio ws:   Assume  we  have   m   inde- 
pendent observations   ^i  »-y? '  *  *  *  ,ym   of  a  random  variable   Y. 

The  usual  estimate  of  its  mean  value   u  =  E(Y)   is  the  sample 

m 
mean   y  ,  where   y  =  S   y./m  .   Now  y   is  the  least-squares 

i  =  l   n 
estimate  of   u  ,  and  therefore  unbiased  with  variance 

2  2  2 

var(y)  =  a  /m  ,  where   a   =  var(y)  .   Of  course   a    is  un- 
known, but  we  can  estimate  it  from  the  data  with  the  sample 

variance 

m 


-^2 


m 


—  2  (y^   -  y) 

i  =  l 


and  then  estimate  the  variance  of  the  estimate   y   of 


as 


var(y)  = 


s2      1 f  ,      -.2 

i  =  l 


m 


This  is  the  basis  for  the  sectioning  routine:  here  the 
y.   are  estimates  of  descriptive  statistics  from  the   m   sec< 
tions  of  the  data  and   y   is  the  average  of  the  statistics 
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from  each  section.   Estimates  are  assumed  independent  because 
the  original  data  is  assumed  to  be  independent. 

A  complete  description  of  how   HISTS   operates  is  con- 
tained in  the  variable   HISTSHOW.   When  the  user  types   HIST- 
SHOW   the  following  response  is  printed  on  the  terminal: 

HISTSHOW 

SYNTAX      HISTS 

HISTS  ALLOWS  YOU  TO  INTERACTIVELY  SECTION  YOUR  DATA  AND 
ASSESS  THE  VARIABILITY  IN  EACH  OF  THE  DESCRIPTIVE  STATISTICS 
BY    USING   THE   SECTIONED   SAMPLE   DATA. 

WHEN  YOU  TYPE  HISTS  YOU  WILL  BE  ASKED  TO  DESIGNATE  THE 
NUMBER      OF      SECTIONS      YOU      DESIRE.  HISTS      WILL      THEN      TAKE 

THE  UNORDERED  DATA  AND  DIVIDE  THE  DATA  INTO  THE  NUMBER 
OF  SECTIONS  YOU  INDICATE  DISCARDING  ANY  DATA  POINTS  LEFT 
OVER.  FOR  EXAMPLE,  IF  YOU  HAVE  301  DATA  POINTS  AND  YOU 
SELECT  10  SECTIONS  HISTS  WILL  PLACE  THE  FIRST  30  DATA  POINTS 
IN  THE  FIRST  SECTION,  THE  SECOND  3  0  DATA  POINTS  IN  THE 
SECOND  SECTION  AND  SO  ON  UNTIL  THE  LAST  DATA  POINT  IS 
OMITTED.  YOU  WILL  NOW  HAVE  10  SECTIONS  WITH  30  DATA  POINTS 
PER   SECTION. 

HISTS  WOULD  NOW  PRINT  THE  FOLLOWING  STATISTICS  ON  EACH  OF 
THE  SECTIONS'.  MEAN,  MEDIAN,  VARIANCE,  STD  DEV,  COEF  VAR , 
SKEWNESS,  KURTOSIS,  MINIMUM  AND  MAXIMUM.  IN  ADDITION,  THE 
ABOVE  STATISTICS  WOULD  BE  PRINTED  FOR  THE  UNSECTI0NED  DATA 
TO    ALLOW   FOR    COMPARISONS. 

FINALLY,  HISTS  WILL  PRINT  (1)  THE  MEAN  OF  THE  SECTIONED 
DATA  STATISTICS.  FOR  EXAMPLE,  THE  MEAN  FOR  SKEWNESS  WOULD  BE 
EACH  SECTION  VALUE  FOR  SKEWNESS  SUMMED  UP  AND  DIVIDED  BY  THE 
NUMBER  OF  SECTIONS.  (2)  THE  VARIANCE  AND  STD  DEV  OF  THE 
SECTIONED  DATA  STATISTICS.  AND,  (3)  THE  STD  DEV  DIVIDED  BY 
THE  SQUARE  ROOT  OF  THE  NUMBER  OF  SECTIONS,  WHICH  ESTIMATES 
THE   STANDARD    DEVIATION   OF    THE   STATISTICS. 

AS  A  RESULT,  HISTS  WILL  GIVE  YOU  AN  UNBIASED  ESTIMATE  OF 
THE  VARIANCE  OF  THE  SAMPLE  MEAN,  MEDIAN,  VARIANCE,  STD  DEV, 
COEF  VAR,  SKEWNESS  AND  KURTOSIS  FROM  USING  THE  SAMPLE 
VARIANCE  OF  THE  SECTIONED  DATA.  WITH  THIS  RESULT,  CONFIDENCE 
INTERVALS  CAN  ALSO  BE  OBTAINED  FOR  EACH  OF  THE  ABOVE  STATIS- 
TICS, IF  THE  ESTIMATES  FROM  THE  SECTIONS  ARE  NORMALLY  DIS- 
TRIBUTED. HISTS  IS  BEST  SUITED  FOR  LARGE  AND  MODERATE  SIZED 
SAMPLES',    FOR    SMALL    SAMPLES    JACKNIFING    SHOULD    BE      CONSIDERED. 
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B.   USAGE  WITH  TELEPHONE  DATA  1 

HISTS   was  now  used  on  telephone  data  1  to  assess  the 
variability  in  the  mean,  median,  variance,  standard  devia- 
tion, coefficient  of  variation,  skewness  and  kurtosis.   When 
HISTS   was  typed  the  following  responses  were  entered  (see 
f i  gure  7) . 

The  672  data  points  of  telephone  data  1  were  broken  down 
into  16  sections  with  42  data  points  per  section.   Because 
of  this  breakdown  no  data  points  were  discarded. 

The  unsectioned  statistics  printed  can  be  compared  with 
the  values  printed  by   HIST   (figure  1)  and  are  in  fact  the 
same.   Providing  that  the  estimates  are  normally  distributed 
(this  can  be  checked  with  the  normal  plots,  described  later), 
confidence  intervals  for  each  of  the  statistics  (mean,  median, 
variance,  standard  deviation,  coefficient  of  variation,  skew- 
ness and  kurtosis)  based  on  the  t-statistic  can  be  obtained 
in  the  following  manner 

-     **„ 

Yn  ~   7m7    0-^),(m-1) 

Here   y    is  the  mean  of  the  sectioned  data  statistics  (ob- 

J  n 

tained  from  column  one  under  summary  for  sectioned  data); 
s- 
■^n   is  the  standard  deviation  of  the  sectioned  data  statis- 

/m 

tic  divided  by  the  square  root  of  the  number  of  sections 

(obtained  from  column  four  under  summary  for  sectioned  data); 

m   is  the  number  sections  chosen;  and,  tfl    -,  \  (m    •>  \   is  the 

(1 -ha) , (m-1 ) 

l-%a  quantile  of  the  t-di s tri bution  with  m-1  degrees  of  free- 
dom. 
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FIGURE    7 
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C.   INTERPRETATION.  OF  RESULTS 

As  an  example,  a  confidence  interval  for  the  coefficient 
of  variation  was  obtained  in  the  following  manner.   The  mean 
value  of  the  coefficient  of  variation  for  the  16  sections  is 
4.1175  (column  1).   The  standard  deviation  divided  by  the 
square  root  of  16  is  .31128  (column  4).   Using   a  =  .05  , 
the  t  value  with  15  degrees  of  freedom  is  2.131.   Thus,  the 
95%  confidence  interval  for  the  coefficient  of  variation  for 
telephone  data  1  is   4.1175  +  ( .  31 1 28) ( 2. 1 31 )   which  is 
[3.454,  4.781].   Confidence  intervals  on  the  six  other  sta- 
tistics could  be  obtained  in  the  same  fashion. 

Again  note  that  the  use  of  the  variance  estimate  from 
the  sectioned  data  to  give  confidence  intervals  is  based  on 
the  assumption  that  the  estimates  from  the  sections  are  in- 
dependent and  normally  distributed.   The  normality  will  de- 
pend on  the  number  of  observations  in  each  section,  which 
should  be  kept  large  to  induce  normality.   This  requirement 
conflicts  with  the  need  to  make  the  number  of  sections  large 
to  reduce  the  variability  in  the  estimate  of  the  variance  of 
the  statistics. 

Another  problem  is  that  if  the  number  of  observations  in 
each  section  is  small,  the  estimates  may  be  severely  biased. 
This  effect  can  be  seen  in  figure  7:  note  that  all  of  the  16 
estimates  of  skewness  from  the  sections  are  smaller  than  the 
estimate   7.1531   from  the  unsectioned  data. 
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V.   JACKNIFE  ROUTINE 

A.   DESCRIPTION 

The  fourth  routine  presented  is  the  jacknife  routine. 
HISTJACK   (jacknife  routine)  is  another  way  of  assessing  the 
variability  in  the  estimates  from  sample  data,  and  also  of 
reducing  bias  in  estimates  of  the  descriptive  statistics. 

The  jacknife  procedure,  like  the  previous  sectioning 
method,  is  based  on  the  assumption  that  an  independent  and 
identically  distributed  random  sample   x,,X2»...»x    have 
come  from  a  population  with  an  unknown  distribution  function 
Fx(x)  .   If  we  divide  the  sample  into   r   groups,  with  each 
group  containing  the  same  number  of  elements,  we  can  obtain 
estimates   9   of  the  descriptive  statistics,  which  we  denote 
generically  as   9  ,  in  the  same  manner  as  previously  done 
with  the  sectioning  method.   The  difference  here  is  that  the 
descriptive  statistics  are  computed  with  the  j    group  de- 
leted  j=l,2,...,r  .   We  then  let   9  /  ,•  \   De  the  result  or 

"t*  h 

the  descriptive  statistic  estimate  computed  with  the  j    sub- 
group omitted,  and   e  -.  -.   is  the  corresponding  result  or  de- 
scriptive statistic  estimated  from  the  entire  sample  (no 
group  omitted).   The  jacknife  pseudo-values  are  then  computed 
in  the  following  way: 


h<   -    <r)(5al1)  -  (r-U(5(J)) 


J    i»^»»»»»r 
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Then  we  define  the  jacknifed  estimator  to  be: 


r 

j-i 


*j 


The  pseudo-values  can  be  used  to  obtain  variance  estimates 
for   0*  ,  and  to  set  approximate  confidence  limits,  using 
Student's  t.   The  idea  is  that  the  pseudo-values  will  be  ap- 
proximately independent  and  possibly  normally  distributed. 

The  jacknifed  estimator   6*   is  a  sample  average  so  we  form 

2 
an  estimate   s*   of  its  variance  given  by  the  following  re- 

lationship  (Miller,  1974): 

~?     1    ~    9 

0      ie; .  -  -  (ze*  .) 


r-1 


2     2, 

s^  =  s  /r 


This  procedure  is  particularly  useful  if  the  number   n   of 
data  points  is  small,  but  it  must  be  used  with  care.   Note, 
that  the  estimator   0*   is  designed  to  eliminate  a   1/n 
bias  term  in  the  estimator   0  . 

A  complete  description  of  how   HISTJACK  operates  is  con' 
tained  in  the  variable   HISTJACKHOW.   When  the  user  types 
HISTJACKHOW   the  following  response  is  printed  on  the  ter- 
minal . 
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HISTJACKHOW 


SYNTAX   HIST JACK 

HISTJACK  ALLOWS  YOU  TO  INTERACTIVELY  JACKNIFE  YOUR  DATA 
AND  ASSESS  THE  VARIABILITY  IN  EACH  OF  THE  STATISTICAL 
ESTIMATES  BY  USING  THE  SAMPLE  DATA. 

WHEN  YOU  TYPE  HISTJACK  YOU  WILL  BE  ASKED  TO  DESIGNATE  THE 
NUMBER  OF  GROUPS  YOU  DESIRE.  HISTJACK  WILL  TAKE  THE 
UNORDERED  DATA  AND  DIVIDE  THE  DATA  INTO  THE  NUMBER  OF 
GROUPS  YOU  INDICATE  DISCARDING  ANY  DATA  POINTS  LEFT  OVER. 
FOR  EXAMPLE,  IF  YOU  HAVE  22  DATA  POINTS  AND  YOU  SELECT  7 
GROUPS  HISTJACK  WILL  PLACE  THE  FIRST  3  DATA  POINTS  IN  GROUP 
1,  THE  SECOND  3  DATA  POINTS  IN  GROUP  2,  AND  SO  ON  UNTIL  THE 
LAST  DATA  POINT  IS  OMITTED.  YOU  WOULD  NOW  HAVE  7  GROUPS 
WITH  3  DATA  POINTS  PER  GROUP.  IF  YOU  HAD  ELECTED  TO  DO  A 
COMPLETE  JACKNIFE,  THAT  IS  TYPED  22,  YOU  WOULD  NOW  HAVE  22 
GROUPS  WITH  1  DATA  POINT  OMITTED  PER  GROUP. 

HISTJACK  WOULD  NOW  PERFORM  STATISTICAL  COMPUTATIONS  USING 
THE  JACKNIFE  PROCEDURE.  THAT  IS ,  BY  OMITTING  ONE  GROUP  AT  A 
TIME,  STARTING  WITH  THE  FIRST  GROUP,  HISTJACK  WOULD  PRINT 
THE  FOLLOWING  STATISTICS:  MEAN,  MEDIAN,  VARIANCE,  STD  DEV , 
COEF  VAR,  SKEWNESS,  KURTOSIS,  MINIMUM  AND  MAXIMUM.  IN 
ADDITION,  THE  ABOVE  STATISTICS  WOULD  BE  PRINTED  FOR  THE 
UNGROUPED  DATA  TO  ALLOW  FOR  COMPARISONS.  (NOTE,  THE  COLUMNS 
GIVE  THE  STATISTIC  ESTIMATED  FROM  ALL  THE  DATA  WITH  ONE 
GROUP  MISSING,  AND  NOT  THE  PSEUDO-VALUES) 

FINALLY,  HISTJACK  WILL  PRINT  (1)  THE  JACKNIFE  ESTIMATE 
(2)  THE  SAMPLE  VARIANCE  OF  THE  PSEUDO-VALUES  DERIVED  IN  THE 
JACKNIFE  ESTIMATE  (3)  AND,  THE  ESTIMATED  STD  DEV  OF  THE 
JACKNIFE  ESTIMATE  DIVIDED  BY  THE  SQUARE  ROOT  OF  THE  NUMBER 
OF  GROUPS. 

AS  A  RESULT,  HISTJACK  WILL  GIVE  YOU  AN  ESTIMATE  OF  THE 
VARIANCE  OF  THE  SAMPLE  MEAN,  MEDIAN,  VARIANCE,  STD  DEV,  COEF 
VAR,  SKEWNESS  AND  KURTOSIS  USING  THE  SAMPLE  VARIANCE  OF  THE 
JACKNIFED  DATA.  WITH  THIS  RESULT,  CONFIDENCE  INTERVALS  CAN 
BE  OBTAINED  FOR  EACH  OF  THE  ABOVE  STATISTICS ,  AGAIN  ASSUMING 
THAT  THE  PSEUDO-VALUES  ARE  APPROXIMATELY  INDEPENDENT  AND 
NORMALLY  DISTRIBUTED .  HISTJACK  IS  BEST  SUITED  FOR  SMALL 
SAMPLES . 
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B.   USAGE  WITH  TELEPHONE  DATA  1 

HISTJACK   was  now  used  on  telephone  data  1  to  assess  the 
variability  in  the  mean,  median,  variance,  standard  devia- 
tion, coefficient  of  variation,  skewness  and  kurtosis.   When 
HISTJACK   was  typed  the  following  responses  were  entered, 
(see  figure  8) 

The  672  data  points  were  broken  down  into  16  groups  with 
42  data  points  per  group.   Again,  because  of  this  breakdown 
no  data  points  were  discarded. 

The  ungrouped  statistics  printed  are  again  the  same 
values  that  were  printed  by   HIST   (figure  1).   Using  the 
jacknife  method,  confidence  intervals  for  each  of  the  statis- 
tics (mean,  median,  variance,  standard  deviation,  coefficient 
of  variation,  skewness  and  kurtosis)  can  be  obtained  in  the 
fol 1 owi  ng  manner  ; 

Here   9*   is  the  jacknife  estimate  of  the  sample  data  (ob- 
tained from  column  one  under  summary  for  jacknifed  data); 
s*   is  the  jacknife  estimate  of  the  standard  deviation 
divided  by  the  square  root  of  the  number  of  groups  (obtained 
from  column  four  under  summary  for  jacknifed  data);   r   is 
the  number  of  groups  chosen;  and,   tfi  ^  \    i    _i\   "is  the 
1-^a   quantile  of  the  t-di s tri buti on  with  r-1  degrees  of  free- 
dom.  The  basis  for  these  assertions  about  the  confidence  in- 
tervals using  the  jacknifing  technique  is  asymptotic  and  great 
care  must  be  taken  in  using  them. 
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FIGURE    8 
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C.  INTERPRETATION  OF  RESULTS 

To  compare  the  confidence  interval  obtained  for  the  co- 
efficient of  variation  using  the  sectioning  routine  with 
that  obtained  using  the  jacknife  routine  the  following  was 
done.   The  jacknife  estimate  of  the  coefficient  of  variation 
for  the  16  groups  is   4.5053   (column  1).   The  jacknife  esti- 
mate of  the  standard  deviation  divided  by  the  square  root  of 
16   is   .3894  .   Using   a  =  .05,  the  t  value  with  15  degrees 
of  freedom  is  2.131.   Thus,  the  95%  confidence  interval  for 
the  coefficient  of  variation  for  telephone  data  1  is  4.5053 
+  (.3894)  (2.131)  which  is  [3.676,  5.335].   This  compares 
with  the  confidence  interval  of  [3.454,  4.781]  using  the  sec^ 
tioning  routine  described  in  section  IV.   Likewise,  confi- 
dence intervals  on  the  remaining  six  statistics  could  be  ob- 
tained in  a  similar  manner.   Note  that  the  values  obtained 
for  the  skewness  coefficient  from  the  sections  are  now  not 
evidently  biased;  of  the  16  values,  7  have  values  below  the 
value   7.1531   for  all  the  data. 

D.  USAGE  WITH  COST  OVERRUN  DATA 

To  demonstrate  how  the  complete  jacknife  could  be  used 
and  why  it  is  better  to  use  when  possible,  the  following  was 
done.   The  22  data  points  of  the  cost  overrun  data  were  used 
with  the  jacknife  routine  (HISTJACK).   When   HISTJACK   was 
typed  the  data  was  entered  in  the  variable  YROVR  and  22  was 
typed  as  the  number  of  groups.   By  typing  22,  which  is  the 
same  as  the  number  of  data  points,  a  complete  jacknife  was 
done. 
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Looking  at  the  output  from  the  complete  jacknife  (figure 
9),  the  cost  overrun  data  can  be  studied.  One  can  note  that 
by  using  the  complete  jacknife  the  mean,  median,  and  variance 
of  the  jacknife  estimate  (column  one  under  summary  for  jack- 
nifed  data)  are  the  same  value  as  the  ungrouped  mean,  median 
and  variance.  But,  also  note  that  the  coefficient  of  varia- 
tion is  less  than  zero  which  can  happen  when  using  the  jack- 
nife technique. 
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VI.   EXPONENTIAL  PLOTTING  ROUTINE 

A.   DESCRIPTION 

The  fifth  routine  presented  is  an  exponential  plotting 
routine.   Routine   EXPONP   is  a  way  of  plotting  the  data  to 
see  if  it  "fits"  an  exponential  distribution,  and  also  to 
give  some  indication  of  what  alternative  distributions  could 
be  used  if  the  exponential  hypothesis  is  rejected. 

A  complete  description  of  how   EXPONP   operates  is  con- 
tained in  the  variable   EXPONPHOW  .   When  the  user  types 
EXPONPHOW   the  following  response  is  printed  on  the  terminal 

EXPONPHOW 

SYNTAX      EXPONP 

EXPONP  ORDERS  THE  DATA  X(I)  AND  COMPUTES 
THE  EMPIRICAL  LOG  SURVIVER  FUNCTION  FOR  THE  DATA. 
THAT   IS, 


\    I  I     l\  I    /        I         \ 

X  VS  I     I  \|   I  1  -   

/  \  (I)        I     I   I    \       17+1   / 


THE    ORDERED    DATA    IS    PLOTTED    AGAINST   THE      LOG      SUR- 
VIVER   FUNCTION    TO   SEE   IF    THERE   IS      A      LINEAR      FIT 
EXPONP   ALSO    ALLOWS   YOU    TO    TITLE   YOUR   PLOT. 


B.   USAGE  WITH  TELEPHONE  DATA  1 

EXPONP   was  used  with  telephone  data  1  to  see  if  the 
data  plotted  as  a  relative  straight  line.   When   EXPONP   was 
typed  the  following  responses  were  entered. 
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EXPONP 
EXPONP    ORDERS    THE    DATA    YOU    GIVE    AND    COMPUTES    THE 
EMPIRICAL      LOG      SURVIVER      FUNCTION    FOR    THE    DATA. 
A    PLOT    OF    THE    LOG    SURVIVER    FUNCTION    FOR    THE    DATA 
IS    THEN    PRINTED    TO    SEE    IF    THERE    IS    A    LINEAR    FIT. 

IF    YOU    WANT    TO    TITLE      YOUR    PLOT    TY^E    YOUR    TITLE. 
IF    YOU    DO    NOT    WANT   A    TITLE    JUST    HIT    T"E    CARRIAGE 

RETURN . 

TELEPHONE    DATA    1 

ENTER    YOUR    DATA    IN    VECTOR    FORM 
TELDAT1 


Looking  at  figure  10  (plot  of  telephone  data  1  using  EX- 
PONP ),  it  was  found  that  the  data  did  not  plot  linearly  from 
the  origin,  but  that  the  data  did  appear  somewhat  linear  in 
-the  tail  (5,000  to  90,000  range). 

C.   USAGE  WITH  RANDOM  GENERATED  EXPONENTIALLY  DISTRIBUTED 
SAMPLE  WITH  MEAN  SAME  AS  TELEPHONE  DATA  1 

As  a  comparison,   EXPONP   was  used  with  an  exponentially 

generated  random  sample  with  the  same  mean  as  telephone  data 

1  (figure  11).   As  expected,  this  plot  is,  within  limits  of 

sample  fluctuations,  linear  from  the  origin  and  in  fact,  what 

telephone  data  1  would  have  looked  like  if  the  data  was  truly 

exponential.   The  quantization  because  of  the  coarseness  of 

the   APL   type-ball  is  evident  in  this  plot.   The  sample  size 

is   672  ,  but  not  all  these  points  can  be  plotted  separately. 
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FIGURE    10 
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FIGURE    11 
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VII.   NORMAL  PLOTTING  ROUTINE 

A.   DESCRIPTION 

The  final  routine  presented  is  a  normal  plotting  routine. 
Routine   NORMP   is  a  way  of  plotting  the  data  to  see  if  it 
"fits"  a  normal  distribution.   In  particular  one  might  want 
to  look  at  estimates  of  descriptive  statistics  obtained  from 
sections  and  groups  in  routines   HISTS   and   HISTJACK  . 

A  complete  description  of  how   NORMP   operates  is  con- 
tained in  the  variable   NORMPHOW  .   When  the  user  types   NORMP 
HOW   the  following  response  is  printed  on  the  terminal. 


NORMPHOW 


SYNTAX      NORMP 


NORMP  ORDERS  THE  DATA  X(I)  AND  COMPUTES  THE 
INVERSE  OF  THE  UNIT  NORMAL  CUMULATIVE  DISTRIBU- 
TION.      THAT   IS% 


\     I  T-l  /  I       \ 

a      VS       *      I  ----- 

/  \  (I)       i    \  N+l    I 


THE  ORDERED  DATA  IS  PLOTTED  AGAINST  THE  INVERSE  OF 
THE  UNIT  NORMAL  CUMULATIVE  DISTRIBUTION  TO  SEE 
IF  THERE  IS  A  LINEAR  FIT.  NORMP  ALSO  ALLOWS  YOU 
TO    CONVIENTLY    TITLE    YOUR    PLOT. 
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B.   USAGE  WITH  COST  OVERRUN  DATA 

NORMP   was  used  with  the  cost  overrun  data  to  see  if 
the  data  plotted  as  a  relative  straight  line.   When   NORMP 
was  typed  the  following  responses  were  entered. 

NORMP 
NORMP  ORDERS  THE  DATA  YOU  GIVE  AND  COMPUTES  TUE 
INVERSE  OE  THE  UNIT  NORMAL  CUMULATIVE  DISTRIBU- 
TION FOR  THE  DATA.  A  PLOT  OF  THE  INVERSE  OF  THE 
UNIT  NORMAL  CUMULATIVE  DISTRIBUTION  VS  THE  ORDER- 
ED DATA  IS  THEN  PRINTED  TO  SEE  IF  THERE  IS  A 
LINEAR    FIT. 

IF  YOU  WANT  TO  TITLE  YOUR  "LOT  TYnE  YOUR  rj,ITLP . 
IF  YOU  DO  NOT  WANT  A  TITLE  JUST  HIT  TT7E  CARRIAGE 
RETURN . 

COST    OVERRUNS 

ENTER    YOUR    DATA    IN    VECTOR    FORM 

□  : 

YROVR 


Note  that  the  cost  overrun  data  was  contained  in  the 
variable   YROVR  .   Looking  at  figure  12  (plot  of  cost  over- 
run data  using   NORMP  ),  it  was  found  that  the  data  did  in 
fact  plot  fairly  linear  through  the  range   -14   to   26  (for- 
mal tests  are  available;  see  Wilk  &  Gnanadesi kan  ,  1968). 

C.   USAGE  WITH  NORMAL  SAMPLE  GENERATED  WITH  MEAN  AND 
VARIANCE  THE  SAME  AS  COST  OVERRUN  DATA 

As  a  comparison,   NORMP   was  used  with  a  normal  sample 
with  the  same  mean  and  variance  as  the  cost  overrun  data 
(figure  13).   As  expected,  this  plot  is  yery    linear.   But 
again,  this  plot  is  not  that  much  different  from  that  of  f i  g ■ 
ure  12,  which  gives  credence  to  the  fact  that  the  cost  over- 
run data  might  in  fact  be  normally  distributed. 
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FIGURE    12 
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FIGURE    13 
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D.   USAGE  WTH  COEFFICIENT  OF  VARIATION  DATA  OBTAINED 
FROM  USING  SECTIONING  ROUTINE 

In  order  to  check  for  normality  in  the  sectioned  esti- 
mates obtained  from  using   HISTS   (sectioning  routine)  the 
following  was  done.   The   16   coefficient  of  variation 
values  obtained  from  using   HISTS   with  telephone  data  1 
(column  5,  figure  7)  were  entered  as  a  vector  into   NORMP  . 
Figure  14  shows  that  the  plot  is  marginally  linear.   This 
demonstrates  the  need  for  formal  tests  to  verify  normality 
in  the  absence  of  a  strictly  linear  plot  (Wilk  &  Gnanadsikan, 
1968). 
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FIGURE    14 
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VIII.   THE  INDEPENDENCE  AND  MARKOV  CHAIN 
HYPOTHESES  FOR  THE  TELEPHONE  DATA 

The  telephone  data  used  in  the  thesis  (Lewis  &  Cox,  1966) 
actually  consists  of  binary  bits  transmitted  over  telephone 
lines  and  the  information  that  the  bit  transmitted  at  time  i, 
i  =  0,1,2,...   is  in  error  or  not.   This  information  is 
characterized  by  a  sequence  of  binary -valued  random  variables 
x(i),  i  =  0,1,...   where   x(i)=l   means  that  the  bit  trans- 
mitted at  time   i   is  in  error,  while   x(i)=0   means  that  the 
bit  transmitted  at  time  zero  is  correctly  transmitted. 

In  telephone  data  1  there  are   672   ones  and   1,105,476 
zeros,  and  a  much  more  compact  and  equivalent  representation 
of  the  data  is  obtained  via  the  sequence  of  random  variables 
y(j)s  j=l,2,...  where  y(j)  is  one  plus  the  number  of  cor- 

J.  L  C  +" 

rectly  transmitted  bits  between  the  j   "and  (j-1)    bit  error, 
with  the  convention  that  y(j)=l  if  the  errors  occur  on  adja- 
cent transmitted  bits,  and  y(l)  is  the  time  from  i=0  to  the 
first  incorrectly  transmitted  bit.   The   y(j)   are  called  the 
times-between-errors. 

A  null  hypothesis  for  the  error  structure  which  could  be 
examined  is  that  errors  occur  independently  at  each  bit  with 
a  fixed  probability,  i.e. 

P{x(i  )  =  1  }  =  tt(1  )  i=0,l  ,.  .. 

P{x(i)  =  0}  =  tt(O)  =  1-tt(1)    i=0,l 
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The  y(j)'s   then  are  independent  and  geometrically  dis 
tributed,  since 

P{y(j)  =  l}  =  P{1f  (J-DSt  error  at  time  i;  jth  at 

time  i+1} 

=  tt(I) 

P{y(j)=2}  =  P{if  (j-l)st  error  at  time  i;  jth  at 

time  i+2} 

=  ir(l)[l-ir(l)]  =  tt(1  )tt(0) 

P(y(j)=k+1}  =  P{ if  (j-l)st  error  at  time  i;  jth  at 

time   i+l+k} 

=  TT(l)[l-7T(l)]k  =  TT(l)[7T(0)]k 


Note  that,  using  the  geometric  series  summation  formula, 


£  p{y(j)=k>  .  1  ,  jn|1)} 


=    1 


E[y(j)]    =X     kP{y(j)  =  k}    .   TT^nJT  =   ^j 


Now  assume  that  the  Markov  structure  of  the  zero's  and 
ones  is  described  by  the  transition  matrix 


P  = 


P(0,0)    P(0,1) 


P(l.O)    P(l,l) 


P  +  (l-pMl)    (l-p)ir(O) 
(I-p)tt(I)    p  +  (l-p)ir(0) 


Here   P(m,n)  =  P{x(i+l)=n   x(i)=m}  ,  and  we  have  para- 
meterized the  chain  in  terms  of  the  stationary  probability 
of  a  one  or  zero,  and  a  correlation  parameter   0<_p<l  .   Note 
that  there  are  only  two  degrees  of  freedom  in  the  stochastic 
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matrix,  since  rows  must  sum  to  1,  and  there  is  only  one  de- 
gree of  freedom  if  the  stationary  probability   tt(0)  =  1  -tt(  1  ) 
is  fixed.   Note  that  the  stationary  probabilities  in  the  2- 
state  case  are  given  by 


TT(0)  "  2-P(0,0)-P(l,l) 


7t(1)  ~  2-P(oioj-P(l,l) 


We  now  define  the  runs  of  ones  or  zeros  i.e.  for  1=0   or 
1=}  ,  let 


T£  =  inf{n>l  :   x(i+n)  t    U-l 


the  length  of  a  run  of  £'s,  starting  after  time  i,  where  the 
1  ength  can  be   0,1,2,...  . 

For  example  if  x(i+l)=l  ,  then  the  length  of  runs  of 
zeros  starting  after  time  i  is  zero,  the  length  of  runs  of 
ones  is  at  least  one  long.  Note  that  it  is  possible  to  talk 
of  a  conditional  runs  structure,  i.e.  the  length  of  a  run  of 
ones  which  is  given  to  start  after  time  i  .  The  run  length 
is  then  at  least  one  long. 

Now  the  probability  of  a  run   T£   having  length  greater 
than   k   is,  using  the  Markov  property, 


P{T^>k}=  P{x(i+l)=x(i  +  2)  =  ...x(i+k)=£}=irU)[PU,JO] 


and 


P{T£  =0}  =  1-ttU)  . 


k-1 


k  =  l  ,.  .  . 


Thus,  the  run  lengths  are  geometrically  distributed  and 
E[T(t)]  -  t,    P{T£>k)  =  Trff^TT  ■  (1.p)[|Lu)J 

IS    I 
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Note  that   p=0   gives  the  independence  case,  and  while 
the  runs  of  ones  or  zeros  are  geometrically  distributed  for 
both  the  independence  or  Markov  dependent  model,  the  mean 
run  length  is  always  longer  for  the  Markov  dependence,  since 


ttU) 


ILl&J 


U-p)Ll-irU)J  -  [1-ttU)] 


0<p<l 


Thus,  we  could  use  the  distributional  properties  of  the 
runs  to  (1)  check  that  either  hypothesis  is  tenable  or  (2) 
if  so,  compare  the  estimated  run  lengths  with  the  mean  length 
7t(a)/[1-tt(a)]   predicted  by  the  independence  assumption.   If 
the  run  lengths  are  not  geometric,  than  another  model  must  be 
pos tul ated. 

Note  that  when  this  mean  time-between-errors  is  large  as 
it  is  for  telephone  data  1  (figure  1;  E[y(j)]=  1,548)  the 
discreteness  of  the  time  scale  can  be  ignored  and  the  geometric 
distribution  is  indistinguishable  from  its  continuous  time 
analog,  the  exponential  distribution. 

That  is  approximation  of  the  geometric  distribution  by 
an  exponential  distribution  is  valid  can  be  seen  from  the 
fact  that  there  are   672   errors   (x(i)'s  equal  to  one)   in 
1,106,148  transmitted  bits,  so  that  an  estimate  of   tt ( 1 )  , 
which  is  the  maximum  likelihood  estimate  under  the  independence 
hypothesi s ,  i s 


ft(l)  - 


#  x(i)'s  =  1 


#  x  ( i  )  '  s  =  1 


total  #  bits  transmitted    #  x( i ) ' s=l +#x ( i ) ' s=0 
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In  the  present  data 


tt(1)  = 


672 


1 ,106,148 


=  .0006075 


Now  this  geometric  hypothesis  will  be  examined,  but  it 
is  clear  from  figure  1  that  the  hypothesis  is  not  true.   The 
distribution  is  in  fact  highly  skewed  and  has  been  examined 
by  Lewis  &  Cox,  1966. 

An  alternative  model  to  independent  bit  errors  is  that 
the  dependence  structure  is  Markovian.   One  could  examine 
this  hypothesis  with  time-series  methods  but  a  method  which 
is  adaptable  for  use  with  the  histogram  routine  and  which  ex- 
amines both  the  independence  and  Markov  assumptions  is  to 
look  at  runs  of  ones  and  zeros  in  the  x(i).   Under  both  hypo- 
thesis these  runs  have  geometrically  distributed  lengths. 

The  alternating  conditional  runs  of  ones  for.  telephone 
data  1  are  shown  in  figure  15  and  for  runs  of  zeros  are  shown 
in  figure  16.   Also,   H I  STL  1ST   was  used  on  the  conditional 
runs  and  figure  17  shows  the  runs  of  ones  and  figure  18  shows 
the  runs  of  zero. 

To  test  the  hypothesis  that  the  runs  of  ones  in  telephone 
data  1  is  geometrically  distributed  the  following  was  done. 
Using  figures  15  and  17  the  following  data  was  obtained: 

MEAN      =  1.235294      #  of  runs  =  1  =  444 


VARIANCE  =   .346008 


#  of  runs  =  2  =  81 

#  of  runs  =  3  =  15 

#  of  runs  =  4  =  1 

#  of  runs  =  5  =  2 

#  of  runs  >  6  =  1 
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FIGURE    15 
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FIGURE    16 


©  ©  o  --*  -*  n  — 

o  _o  ©  ©  o  ©  © 

bi  US  bl  b]  be,  Crj  tt" 

©    ©  O  ©  O  O  © 

o  o  ©  o  o  o  o 

©    O  ©  ©  O  O  "N 

O  O  O  ©  in  u  ^ 

o  o  o  o  tri  r  c 

OOO  J  J  Ifltrt 

-h  »-»  o  cn  --«  *-■  co 


»-  -5  — • 

k  it; 

tt  *^t  *jr 

S;  d  S: 

t~,  b;  >- 


E-.  E-.  E-  E-,   E-, 

fe  a:  a  a  a 

5:  *»:  ^  ^:  *s  ^  5: 

-a  -s  a  :s  -s  a  :s 

5  <*  o  c-o  o  r- 


<o 

CN 

r» 

o 

—i 

(N 

r*. 

E-i 

T-t 

-H 

O 

O 

•1-H 

■»-» 

% 

to  tc 

cc 

Co  tc 

-«- 

U: 

o 

u") 

CO 

r- 

o 

X 

r~ 

fN 

— 

C*3 

CN 

X 

x 

e 

O 

«H 

CO 

^O 

C"! 

X 

.r^ 

«-« 

in 

rs: 

r-- 

3" 

■-» 

it 

uO 

i— 

— 

o 

— 

*-J 

o 

CO 

— 

o 

:d 

X 

*c 

CN 

o 

CN 

p- 

s» 

-*. 

^ 

^ 

Gr 

b3 

t-> 

Cc 

-5- 

Cl 

*-« 

:n 

^ 

^ 

E-, 

*-C 

■*-: 

(£ 

t*. 

5r 

f* 

f- 

i— 

m 

— 

:■*: 

^ 

t-c 

•A. 

=^ 

zz 

3£ 

'o 

•Vd 

i: 

r—  o  ©  o  r*  r* 

o  o  o  O  o  © 

bl  b:  bj  bj  b)  bi 

r»  r-  cm  o  c  o 

CTi  C*  CO  ©  o  © 

r-  n  o  r-  i-i  c 

O  in  --*  O  C"i  u~ 

O  CD  CM  ©  en  cc- 

on  ,o  o  ^>  in  cr 

in  r»>  j  -i  co  ih 


b:  Cr;  i»  ^ 

ro  i*  "=•  b:  fel 

=»  b!  i»  a  Cr: 

^  C  bj  t. 

--  b.  fetj  BJ 

--■  E-.  Ti  be---  -- 

^  "o  sj  5  i;  S: 


5:  5  5  -3  5 


bj 

en 

fH 

«H 

•-U 

r~ 

© 

r» 

o 

o 

o 

©O 

O 

o 

in 

tu  bi 

b) 

bjbj 

b3 

b; 

CO 

o 

a 

O 

HO 

en 

o 

CM 

r- 

o 

o 

o  in 

CD 

!■( 

J 

CM 

a 

m 

m  o 

"T 

m 

«-t 

— < 

a 

p^ 

*  c* 

CM 

j) 

>-. 

** 

a 

x 

0*1  C~" 

CM 

CB 

-*• 

to 

cn 

z 

o 

coo* 

13 

in 

bj 

*H 

CN 

■j\ 

<-5  rr 

to 

in 

n 

a 

-a 

* 

bi 

6-. 

E-i 

IS 

S& 

Q 

b; 

«-c 

^ 

-J 

s- 

-eti 

^ 

bl 

i 

■=: 

a 

«s 

-<■; 

■V 

:~ 

6; 

>* 

b] 

-c-3: 

•O, 

E-* 

S 

»-h 

^ 

?*  Crc 

"2 

r^* 

-J 

* 

«s 

Cl 

»-* 

-jo. 

C^; 

Cr; 

b; 

EC 

bj 

u- 

ft; 

-—  i— • 

^: 

T 

to 

C> 

5 

~ 

— 

s^c-s 

« 

S! 

65 


FIGURE    17 
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FIGURE    18A 
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FIGURE    18C 
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If  the  runs  of  ones  are  geometric  then  prob{x(i)=k}  = 
(l-p)p  k=l,2,...  .  Thus,  this  is  the  "geometric  plus 
one"  distribution. 

u  =  E[X]  =  ] 

(1  -  P) 

1 


a2=  VAR[X]  = 


(1  -  P) 


c(X)  -mini .  p5 

E[X] 


To  find   p   set   E[X]  =  1.235294  =  l/(l-p) 

p  =   .1904761 
Therefore,  if  the  data  is  "geometric  plus  one"  then 

EXPECTED  VAR[X]  =  .  1 904761 /(. 8095329 ) 2 

=  .2906572 


Thus,  the  expected  variance  is  .2906572  and  the  observed  var- 
iance from  HIST  is  .3460080  .  Also,  the  expected  coefficient 
of  variance  is 

EXPECTED  C(X)    =  (.1904761)^  =  .4364356 

And,  the  observed  coefficient  of  variation  is  .4761817  . 

Therefore,  at  this  point  there  seems  to  be  a  fairly  close 
agreement  between  the  runs  of  one  and  a  "geometric  plus  one" 
distribution  with  p  =  .1904761  . 

As  further  proof  a  Chi -square  test  for  goodness  of  fit 
was  run  on  the  runs.   By  using  the  formula 


prob  {X  =  x}  =  ( l-p)p 


x-1 


for  x=l ,2 , 3,4, 5, . .  . 
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PROBABILITY 


EXPECTED 


P(X=1)  = 

P(X=2)  = 

P(X=3)  = 

P(X=4)  = 

P(X=5)  = 


8095239 
1541949 
0293704 
0055943 
0010655 


P(X>6)  =  .0002510 


19.74 


OBSERVED 

444 
81 
15  1 

2 
1 


19 


Note,  to  use  Chi-square  not  more  than  20%  of  the  cells 
should  have  expected  frequencies  less  than  5  and  no  cell 
should  have  an  expected  frequency  less  than  one.   Therefore, 
the  above  frequencies  must  be  combined  into   3   cells. 


2    3   (obsi  -  ex^ 

=  Z,  — 


=  .1562799 


i  =  l 


ex. 


And,   x  nc  o  =  5.99  .   Thus,  the  null  hypothesis  that  the 

runs  of  one  are  "geometric  plus  one"  with  p  =  .1904761  can 

not  be  rejected. 

A  similar  procedure  was  done  with  the  runs  of  greater 

than  one.   By  using  figure  15  the  following  information  can 

be  obtained: 

MEAN      =  1911  .27 
VARIANCE  =  59,064,970 
C0EF.VAR.=  4.021082 

And,  by  using  the  same  method  as  previously  done  and  solving 
for   p   one  gets   p  =  .9994767  . 

EXPECTED  VAR[X]  =  . 9994767/ (. 0005233 ) 2  =  3,651  ,213 
This  expected  variance  differs  greatly  from  the  observed 
variance.   Also,  the  expected  coefficient  of  variation  is 
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computed  to  be 

EXPECTED  C(X)  =  (.9994767)^  =  .9997383 

This  compares  with  the  observed  coefficient  of  variation  of 
4.021082  .   Because  of  the  gross  departures  of  the  variance 
and  the  coefficient  of  variation  in  the  geometric  hypothesis, 
one  can  conclude  that  the  runs  of  length  greater  than  1  are 
not  geometrically  distributed. 
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IX.   DOCUMENTATION  ON  ROUTINES 

A.  LOCATION  IN   APL   LIBRARY 

The  descriptions  and  routines  that  have  been  presented 
are  all  available  in  the   APL   workspace  library  2  DATALFNS  . 
Providing  the  user  is  properly  logged  on  the  terminal  and. in 
the   APL   mode,  all  that  is  necessary  is  to  type   )L0AD  2 
DATALFNS  .   If  the  user  then  types   DESCRIBE,  a  short  descrip 
tion  of  the  six  routines  presented  and  instructions  on  how 
to  obtain  the  detailed  information  that  is  available  in  each 
of  the   "HOW"   variables  would  be  printed. 

B.  WORKSPACE  LOADING  PROCEDURES 

Each  of  the  routines  was  designed  to  stand  alone.   That 
is,  if  the  user  desires  just  to  use   HIST  ,  all  that  is  neces 
sary  is  to  type   )C0PY  2  DATALFNS  HISTGRP   into  a  clear  work- 
space.  HISTGRP   contains  the  principal  routine   HIST   and 
only  the  additional  routines  necessary  for   HIST   to  operate. 
Thus,  the  user  does  not  clutter  his  workspace  with  any  un- 
needed  functions.   It  is  this  group  structure  that  maintains 
the  orderliness  of  the  workspace.   And,  the  ability  to  copy 
a  particular  group  into  a  clear  workspace  provides  more  space 
for  data  and  executions  of  the  functions. 

The  following  is  the  group  structure  in  library  2 
DATALFNS  . 
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GROUP 


HISTGRP 


HISTSGRP 

HISTJACKGRP 

EXPONPGRP 


NORMPGRP 


PRINCIPAL 
ROUTINE 

HIST 


HISTLISTGRP    HISTLIST 


HISTS 

HISTJACK 

EXPONP 


NORMP 


DESCGRP  (Descriptive  group) 


VARIGRP  (Variable  group) 


OTHER  NECESSARY 
ROUTINES 

APLNAME,APLOT, AUTOS, 
CMS,DFT,ECDF,ECODE, 
EFT, OF, OUT, WRITE 

APLNAME,CMS,ECODE, 
DFT, OF, OUT, WRITE 

DFT, EFT 

DFT, EFT, TOT 

AND,AUTOSCALE5 
INITIAL, MPLOT,MSGS, 
VS,MULTIPLOT,SETAAP, 
TICMARK 

AND,AUTOSCALE, 
INITIAL, MPLOT,MSGS, 
VS,MULTIPLOT,SETAAP, 
TICMARK 


VARIABLES 


BS 


BS 


DESCRIBE, HISTHOW 
HISTHOW,HISTLIST- 
HOW,HISTJACKHOW, 
EXPONPHOW,NORMPHOW 

TELDAT1 ,TELDAT2, 
YROVR 


C.   ROUTINE  LISTING 

The  above  mentioned  routines  were  either  created  by  the 
author,  adapted  from  existing  fortran  routine   HISTG/F  ,  or 
borrowed  from  the  current   APL   library  to  supplement  the 
author  created  routines. 
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1 .   Author  Created  Routines 

HISTLIST,  HISTS,  HISTJACK,  EXPONP,  NORMP,  APLOT, 
AUTOS,  OUT,  TOT 

2-   Adapted  from  Fortran  Library  Routine   HISTG/F 

HIST,  ECDF 

3«   Borrowed  Routines  to  Supplement  Author  Created 
Routi  nes 

AND,  APLNAME,  AUTOSCALE,  CMS,  DFT,  ECODE,  EFT, 
INITIAL,  MPLOT,  MSGS,  MULTIPLOT,  NDTRI ,  OF,  SETAAP,  TICMARK 
VS,  WRITE 
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X.    COMPUTER    LISTING    OF   ALL    ROUTINES 
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